AITopics | Kiên Giang Province

Collaborating Authors

Kiên Giang Province

Multi-Dialect Vietnamese: Task, Dataset, Baseline Models and Challenges

Van Dinh, Nguyen, Dang, Thanh Chi, Nguyen, Luan Thanh, Van Nguyen, Kiet

arXiv.org Artificial IntelligenceOct-4-2024

Vietnamese, a low-resource language, is typically categorized into three primary dialect groups that belong to Northern, Central, and Southern Vietnam. However, each province within these regions exhibits its own distinct pronunciation variations. Despite the existence of various speech recognition datasets, none of them has provided a fine-grained classification of the 63 dialects specific to individual provinces of Vietnam. To address this gap, we introduce Vietnamese Multi-Dialect (ViMD) dataset, a novel comprehensive dataset capturing the rich diversity of 63 provincial dialects spoken across Vietnam. Our dataset comprises 102.56 hours of audio, consisting of approximately 19,000 utterances, and the associated transcripts contain over 1.2 million words. To provide benchmarks and simultaneously demonstrate the challenges of our dataset, we fine-tune state-of-the-art pre-trained models for two downstream tasks: (1) Dialect identification and (2) Speech recognition. The empirical results suggest two implications including the influence of geographical factors on dialects, and the constraints of current approaches in speech recognition tasks involving multi-dialect speech data. Our dataset is available for research purposes.

artificial intelligence, machine learning, natural language, (19 more...)

arXiv.org Artificial Intelligence

2410.03458

Country:

Asia > Vietnam > Hanoi > Hanoi (0.14)
Asia > Vietnam > Thanh Hóa Province > Thanh Hóa (0.04)
Asia > Vietnam > Hưng Yên Province > Hưng Yên (0.04)
(65 more...)

Genre: Research Report > New Finding (0.66)

Industry: Transportation > Ground > Road (0.46)

Technology:

Information Technology > Artificial Intelligence > Speech > Speech Recognition (1.00)
Information Technology > Artificial Intelligence > Natural Language (1.00)
Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

Crossing Linguistic Horizons: Finetuning and Comprehensive Evaluation of Vietnamese Large Language Models

Truong, Sang T., Nguyen, Duc Q., Nguyen, Toan, Le, Dong D., Truong, Nhi N., Quan, Tho, Koyejo, Sanmi

arXiv.org Artificial IntelligenceMay-26-2024

We employ Large language models (LLMs) such as GPT-fine-tuning on the LLaMa-2, Mixtral 8 7B, 4 (OpenAI, 2023), BLOOM (Le Scao et al, Gemma, and conduct a comprehensive evaluation 2023), LLaMa-2 (Touvron et al, 2023), Mistral of Vietnamese LLMs across various scenarios and (Jiang et al., 2023), Mixtral (Jiang et al., 2024), settings. Throughout the thorough evaluation process, Gemma (Team et al., 2024) have made significant we observe the following: (i) larger language contributions to the field of natural language processing models exhibit unseen capabilities compared to (NLP). Despite their advancements, a gap smaller counterparts; (ii) larger language models remains in their specialization for many languages, tend to manifest more biases, produce uncalibrated including Vietnamese. This paper addresses the results, and are more susceptible to the influence development and evaluation of Vietnamese-centric of input prompts; (iii) the quality of training or LLMs. Vietnam, with a population surpassing 100 fine-tuning datasets is the key for unlocking LLM million, ranks as the 16th most populous country performance. Our key contributions include: globally.

dataset, gemini, gpt-3, (15 more...)

arXiv.org Artificial Intelligence

2403.02715

Country:

Asia > Middle East > Qatar (0.27)
Europe > Norway (0.14)
Asia > Middle East > Kuwait (0.14)
(100 more...)

Genre: Research Report > New Finding (0.92)

Industry:

Government (1.00)
Education (1.00)
Health & Medicine (0.67)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Fibonacci and k-Subsecting Recursive Feature Elimination

Brzezinski, Dariusz

arXiv.org Machine LearningJul-29-2020

Feature selection is a data mining task with the potential of speeding up classification algorithms, enhancing model comprehensibility, and improving learning accuracy. However, finding a subset of features that is optimal in terms of predictive accuracy is usually computationally intractable. Out of several heuristic approaches to dealing with this problem, the Recursive Feature Elimination (RFE) algorithm has received considerable interest from data mining practitioners. In this paper, we propose two novel algorithms inspired by RFE, called Fibonacci- and k-Subsecting Recursive Feature Elimination, which remove features in logarithmic steps, probing the wrapped classifier more densely for the more promising feature subsets. The proposed algorithms are experimentally compared against RFE on 28 highly multidimensional datasets and evaluated in a practical case study involving 3D electron density maps from the Protein Data Bank. The results show that Fibonacci and k-Subsecting Recursive Feature Elimination are capable of selecting a smaller subset of features much faster than standard RFE, while achieving comparable predictive performance.

algorithm, artificial intelligence, machine learning, (17 more...)

arXiv.org Machine Learning

2007.1492

Country:

Europe > Poland > Greater Poland Province > Poznań (0.05)
Europe > United Kingdom > England > Cambridgeshire > Cambridge (0.04)
Asia > Vietnam > Kiên Giang Province > Rạch Giá (0.04)

Genre:

Research Report > Experimental Study (0.68)
Research Report > New Finding (0.66)

Industry: Health & Medicine > Pharmaceuticals & Biotechnology (0.68)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning > Optimization (0.46)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (0.34)

Add feedback